Nonparametric Bayesian Biclustering
نویسندگان
چکیده
We present a probabilistic block-constant biclustering model that simultaneously clusters rows and columns of a data matrix. All entries with the same row cluster and column cluster form a bicluster. Each cluster is part of a mixture having a nonparametric Bayesian prior. The number of biclusters is therefore treated as a nuisance parameter and is implicitly integrated over during simulation. Missing entries are completely integrated out of the model, allowing us to completely bipass the common requirement for biclustering algorithms that missing values be filled before analysis, but also makes it robust to high rates of missing values. By using a Gaussian model for the density of entries in bliclusters, an efficient sampling algorithm is produced because bicluster parameters are analytically integrated out. We present several inference procedures for sampling cluster indicators, including Gibbs and split-merge moves. We show that our method is competitive, if not superior, to existing imputation methods, especially for high missing rates, despite imputing constant values for entire blocks of data. We present imputation experiments and exploratory biclustering results. Nonparametric Bayesian Biclustering Edward Meeds and Sam Roweis Department of Computer Science, University of Toronto 1 Biclustering Biclustering (also known as co-clustering or 2-way clustering) refers to the the simultaneous grouping of rows and columns of a data matrix. Each bicluster is a submatrix of the full (possibly reordered) data matrix and entries in a bicluster should have some coherent structure (the details of which depend on the method employed). This coherence could be, for example, constant values for all entries in the submatrix, or similar row/column patterns within a bicluster. Biclustering algorithms are also characterized by how the rows and columns are assigned to clusters. Rows/columns can either belong to multiple clusters (as shown in Figure 2A & 2B) or to only a single cluster (as shown in 2C); clusters can overlap (2A) or not (2B & 2C). Some matrix entries may also belong to a “background” noise model which is not part of any bicluster (2A & 2B). Most representations assume that there exists a single permutation of the matrix rows/columns after which all the biclusters are contiguous blocks. (Matrix tile analysis [4] is an exception.) Our approach produces biclusters like those in Figure 2C: each row and column belongs to a single, non-overlapping cluster. Figure 1: Left: Original data. Right: Data after biclustering. Assessing the significance of partitions discovered by biclustering is problematic for several reasons. First, there are few available data sets which are annotated with ground truth partitions. Second, those that are annotated may have partitions that do not correspond to any possible result of the clustering algorithm. Third, most algorithms have parameters which modify the scale/size of partitions which are discovered. Deciding which scale is best in a purely unsupervised manner is difficult and poorly defined. For example, a common goal when clustering microarray data is to group genes and/or experiments in such a way that the partitions are biologically “significant” or “plausible”; often this is assessed by examining the clusters by hand [2, 8]. Several modeling issues must be addressed by any biclustering method. The most important is a method for assessing when a partition represents a significant bicluster. This is closely related to the choice of the number of clusters to use. Most biclustering use greedy procedures for fitting biclusters one at a time until either a global fitting objective or a pre-specified number of clusters has been reached [2, 8]. Of course, if the only objective is to reduce some measure of fitting residual, overfitting will occur unless the model is highly regularized, especially in very flexible non-probabilistic models. By restricting the permissible types of clusters we can control capacity; we can also use a probabilistic model of the data and use marginal likelihood as a guide. The biclustering algorithm that we present here is a fully probabilistic model which uses Bayesian nonparametric priors over row and column clusters. This allows us to treat the number of biclusters as a nuisance parameter and implicitly integrate it out
منابع مشابه
Nonparametric Bayesian Co-clustering Ensembles
A nonparametric Bayesian approach to co-clustering ensembles is presented. Similar to clustering ensembles, coclustering ensembles combine various base co-clustering results to obtain a more robust consensus co-clustering. To avoid pre-specifying the number of co-clusters, we specify independent Dirichlet process priors for the row and column clusters. Thus, the numbers of rowand column-cluster...
متن کاملBayesian Nonparametric and Parametric Inference
This paper reviews Bayesian Nonparametric methods and discusses how parametric predictive densities can be constructed using nonparametric ideas.
متن کاملIntroducing of Dirichlet process prior in the Nonparametric Bayesian models frame work
Statistical models are utilized to learn about the mechanism that the data are generating from it. Often it is assumed that the random variables y_i,i=1,…,n ,are samples from the probability distribution F which is belong to a parametric distributions class. However, in practice, a parametric model may be inappropriate to describe the data. In this settings, the parametric assumption could be r...
متن کاملGene co-expression networks via biclustering Differential gene co-expression networks via Bayesian biclustering models
Identifying latent structure in large data matrices is essential for exploring biological processes. Here, we consider recovering gene co-expression networks from gene expression data, where each network encodes relationships between genes that are locally co-regulated by shared biological mechanisms. To do this, we develop a Bayesian statistical model for biclustering to infer subsets of co-re...
متن کاملBiclustering-Driven Ensemble of Bayesian Belief Network Classifiers for Underdetermined Problems
In this paper, we present BENCH (Biclusteringdriven ENsemble of Classifiers), an algorithm to construct an ensemble of classifiers through concurrent feature and data point selection guided by unsupervised knowledge obtained from biclustering. BENCH is designed for underdetermined problems. In our experiments, we use Bayesian Belief Network (BBN) classifiers as base classifiers in the ensemble;...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007